Periodic backup
full back up every 4 hours, last two back ups stored (default) -> raise support ticket to recover backups retention period can be configured Multi region write works different to single region write If db is deleted, 30 days recovery options, data backup are restored into azure blob storages. GRS Won't effect performance
GRS: copy to paired regions (async)
ZRS: copy to three availability zones in the primary region (sync)
LRS: copy to single physical locations in the primary region (sync)
backup interval / retention /storage redundancy
data factory cosmos db data migration tool cosmos db change feed custom data migration app
Continuous backup
data retentions up to 30 days in every region where the Cosmos DB account exists
-> If the Azure Cosmos DB account is using the strong consistency level, the backups taken in the write region could be more up to date than the backups taken in the read regions.
Monitor events & Create Alerts
Monitor using http status code, implement circuit breaker / failsafe 429 + 503 = transient errors
Replace document
To check 409 Conflict
- The id provided for the new document has been taken by an existing document.
400 Bad Request
- The request was specified with an incorrect SQL syntax or was missing required headers.
- The override set in x-ms-consistency-level is stronger than the one set during account creation. For example, if the consistency level is Session, the override can't be Strong or Bounded. Get a Document
Dedicated gateways -> HTTPS Direct -> TCP
Transient error 429: Review occurrences of this exception in the Azure Cosmos DB Insight report Total Request by Status Code under the Request tab. Review what is the percentage of 429 exceptions vs. successful requests against your database.
Investigate rate limiting occurence steps:
- Insights (Standard) Total Request by Status Code (verify if it's normal)
- Insights (Standard) Normalized RU Consumption (%) By PartitionKeyRangeID (hot partition)
- Change partition key strategy
- Increase throuput
- investigate high RU requests
- Meta-data: system allocate rate limit policy
If this type of request causes 429 exceptions, increasing the provisioned RU/s isn't recommended. There's a system-reserve RU limit for metadata requests.
- Transient (error by MS) Raise Support tickets if persists
Common alerting scenarios The following are some scenarios where you can use alerts:
When the keys of an Azure Cosmos account are updated. When the data or index usage of a container, database, or region exceeds a certain number of bytes. When the normalized RU/s consumption is greater than a certain percentage. When a region is added, removed, or goes offline. When a database or a container is created, deleted, or updated. When the throughput of your database or container is changed.
Control Plane Logs:
Modification commands to regions / account / settings
Data Plane logs:
/21
CDBPartitionKeyStatistics // Get the latest storage size for each logical partition key value | summarize arg_max(TimeGenerated, *) by AccountName, DatabaseName, CollectionName, _ResourceId, PartitionKey | extend utilizationOf20GBLogicalPartition = SizeKb / (20.0 * 1024.0 * 1024.0) // Current storage / 20GB | project TimeGenerated, AccountName, DatabaseName, CollectionName, _ResourceId, PartitionKey, SizeKb, utilizationOf20GBLogicalPartition77
Implement security / Regional failovers
RBAC
global
Create resources
az cosmos: account info
az cosmos sql: database / cotainer and sub groups
az cosmos sql database/container (create).
az cosmosdb sql database throughput update
--account-name '
--resource-group '
--name '
--throughput '4000'
az cosmosdb sql container throughput migrate
....
--throughput-type 'autoscale'
followed by
az cosmosdb sql container throughput update
....
--max-throughput '5000'
az cosmosdb update
....
--locations regionName='eastus' failoverPriority=0 isZoneRedundant=False
--locations regionName='westus2' failoverPriority=1 isZoneRedundant=False \
az cosmosdb update
....
--enable-automatic-failover 'true'
az cosmosdb failover-priority-change
....
--failover-policies 'eastus=0' 'centralus=1' 'westus2=2'
az cosmosdb update
....
--enable-multiple-write-locations 'true'
az cosmosdb failover-priority-change
--name '
--resource-group '
--failover-policies 'westus2=0' 'eastus=1'
This will fail WEST US 2 Any priority change to a region that is != 0 will not trigger a failover.
Bicep
Microsoft.DocumentDB/databaseAccounts Represents an account Microsoft.DocumentDB/databaseAccounts/sqlDatabases Represents a NoSQL API database Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers Represents a NoSQL API container
Cosmos db monitor ? Insights
See status code percentages Get normalised RU consumption per partion key ranges
Insights > System Get metadata request that exceed capacity / by status code
activity and platform metrics are collected automatically
Operational data
The NoSQL API log tables are:
DataPlaneRequests (crud) - This table logs back-end requests for operations that execute create, update, delete, or retrieve data. QueryRuntimeStatistics - This table logs query operations against the NoSQL API account. PartitionKeyStatistics - This table logs logical partition key statistics in estimated KB. It's helpful when troubleshooting skewed storage. PartitionKeyRUConsumption - This table logs every second aggregated RU/s consumption of partition keys. It's helpful when troubleshooting hot partitions. ControlPlaneRequests - This table logs Azure Cosmos DB account control data, for example adding or removing regions in the replication settings.
CDBDataPlaneRequests | where TimeGenerated >= ago(1h) | summarize OperationCount = count(), TotalRequestCharged=sum(todouble(RequestCharge)) by OperationName | order by TotalRequestCharged desc
Shoulds what query causing hot partition in Azure Diagnostic Logs:
AzureDiagnostics | where TimeGenerated >= ago(24h) | where Category == "DataPlaneRequests" | summarize throttledOperations = dcountif(activityId_g, statusCode_s == 429), totalOperations = dcount(activityId_g), totalConsumedRUPerMinute = sum(todouble(requestCharge_s)) by databaseName_s, collectionName_s, OperationName, requestResourceType_s, bin(TimeGenerated, 1min) | extend averageRUPerOperation = 1.0 * totalConsumedRUPerMinute / totalOperations | extend fractionOf429s = 1.0 * throttledOperations / totalOperations | order by fractionOf429s desc
Diagnostic loggings to investigate 403 generic forbidden erros, check incoming IP address + the IP Settings
Azure Resource Logs?????? Enable Azure resource logs for Cosmos DB. You can use Microsoft Defender for Cloud and Azure Policy to enable resource logs and log data collecting. These logs can be critical for investigating security incidents and doing forensic exercises. Enable the auditing control plane under Diagnostics settings. You want to get an alert when the firewall rules for your Azure Cosmos account are modified. The alert is required to find unauthorized modifications to rules that govern the network security of your Azure Cosmos account and take quick action.
To review:
-Max partition throughput values